import cv2 as cv
import matplotlib.pyplot as plt
import pandas as pdURL to the blog post:
https://jamie1130.github.io/PIC-16B/posts/Final Project/
Link to the git repo:
https://github.com/torwar02/trails
Project Overview
Using the internet, it’s very easy to find information about hiking trails throughout the United States, but there’s a problem: it’s a lot easier to browse through lists of trails available online, like on the website trailforks.com, and filtering by location. What are you supposed to do if you don’t know much about an area but still want to go hiking, or what if you want to do certain things on your hike but don’t know where to go? You could do lots of research online yourself, reading articles or sifting through different locations, but this process can be very difficult, especially if you want to go somewhere that you’ve never been.
Our project addresses finding new places to visit based upon what type of hiking trails the user wants to see (which the user would describe) based upon previously hiked trails. For instance, if the user previously hiked Half Dome in Yosemite, then we give the user other trails and locations in the United States most similar to Half Dome according to reviews on TripAdvisor. This would be mainly for tourists who want to visit parts of the country they have not been to.
Let’s have a quick visual overview of the structure of the project.

Webscraping Trip Advisor - Tyler
Here we webscrape TripAdvisor to gather reviews for semantic analysis such that we can recommend trails based upon how similar reviews are to each other. We do so using scrapy!
This is what the first page looks like:
img = cv.imread('/content/tylercap/Capture.JPG')plt.imshow(img)
The first function we write is parse. This function redirects users from the general national park page to the page with just activities to do.
def parse(self, response):
next_page = response.xpath("//div//a[contains(@href, 'Attraction')]//@href").get() #xpath command to redirect to activities
yield response.follow(next_page, callback = self.parse_full_credits) #redirects to the new url page and executes parse_full_creditsThe second function we write is parse_full_credits. This function redirects users from the list of all activities to each individual trail page. We do this so that we can extract reviews and information of individual trails per national park.
This is what the second page looks like:
img = cv.imread('/content/tylercap/Capture2.JPG')
plt.imshow(img)
def parse_full_credits(self, response):
trail_page = response.xpath("//div[@class = 'BYvbL A']//a[@class = 'BUupS _R w _Z y M0 B0 Gm wSSLS']//@href").getall()
for trail in trail_page: #for every trail in the trial page list url, execute the callback command parse_actor_page
yield response.follow(trail, callback = self.parse_trail)Finally, the last function we write is parse_trail. This function outputs important data such as National Park name, state national park is in, trail name, overall trail rating, title of comment, text of comment, and individual comment rating.
def parse_trail(self, response):
national_park = response.xpath("//span[@class = 'fxMOE']//text()").get()
state = response.xpath("//span[@class = 'n q']//span[@class = 'biGQs _P pZUbB avBIb osNWb']//text()").getall()[1]
trail = response.xpath("//h1//text()").get()
comment_title = response.xpath("//div[@class = 'LbPSX']//div[@class = 'biGQs _P fiohW qWPrE ncFvv fOtGX']//span//text()").getall()
ratings = response.xpath("//div[@class = 'LbPSX']//svg[@class = 'UctUV d H0']//title//text()").getall()
comment_text = response.xpath("//div[@class = 'LbPSX']//span[@class = 'JguWG']//span[@class = 'yCeTE']//text()[1]").getall()
pictures = response.xpath("//div[@class = 'LbPSX']//span[@class = 'biGQs _P XWJSj Wb']//img//@srcset").getall()
overall_rating = response.xpath("//div[@class = 'biGQs _P fiohW hzzSG uuBRH']//text()").get()
trail_type = response.xpath("//div[@class = 'biGQs _P pZUbB alXOW oCpZu GzNcM nvOhm UTQMg ZTpaU W KxBGd']//span//text()").get()
for ix in range(len(comment_title)):
yield {
"national_park" : national_park,
"state" : state,
"trail":trail,
"activity": trail_type,
"overall_rating": overall_rating,
"comment_title":comment_title[ix],
"comment_ratings":ratings[ix],
"comment_text":comment_text[ix]
}This is what the trail page with reviews looks like:
img = cv.imread('/content/tylercap/Capture3.JPG')
plt.imshow(img)
Now, we want to get this data in csv format. To do so, we go to the directory which holds our spider and run scrapy crawl trip_advisor -o national_parks.csv. Great, now we can analyze our reviews!
Review Similarity Trail Recommender - Tyler
Now, we utilize functions and word embedding to return the most similar trails and their location in the United States based upon the csv file we just created from our webscraper!
Firstly, let us import the packages we need. en_core_web_lg is a 560 MB model imported from SpaCy with has 514 thousand unique word vectors and reduces these vectors down to 300 dimensions for predictability. We do so to take advantage of built in word vectors so we don’t have to do this ourselves.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import gensim
import spacy!python -m spacy download en_core_web_lgCollecting en-core-web-lg==3.7.1
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 587.7/587.7 MB 2.9 MB/s eta 0:00:00
Requirement already satisfied: spacy<3.8.0,>=3.7.2 in /usr/local/lib/python3.10/dist-packages (from en-core-web-lg==3.7.1) (3.7.4)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (1.0.10)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.0.8)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.0.9)
Requirement already satisfied: thinc<8.3.0,>=8.2.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (8.2.3)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (1.1.2)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.4.8)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.0.10)
Requirement already satisfied: weasel<0.4.0,>=0.1.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.3.4)
Requirement already satisfied: typer<0.10.0,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.9.0)
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (6.4.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (4.66.2)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.31.0)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.6.4)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.1.3)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (67.7.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (24.0)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.3.0)
Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (1.25.2)
Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.6.0)
Requirement already satisfied: pydantic-core==2.16.3 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.16.3)
Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (4.10.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2024.2.2)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.7.11)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.1.4)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.10/dist-packages (from typer<0.10.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (8.1.7)
Requirement already satisfied: cloudpathlib<0.17.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from weasel<0.4.0,>=0.1.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.1.5)
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_lg')
⚠ Restart to reload dependencies
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
nlp = spacy.load('en_core_web_lg')Load Data
Now let us load the data we scraped by TripAdvisor as well as a Excel file containing coordinate points of our national parks so that we can create a geograpical plot later
df = pd.read_csv('https://raw.githubusercontent.com/torwar02/trails/main/trails/national_parks.csv')df2 = pd.read_excel('https://raw.githubusercontent.com/torwar02/trails/main/trails/coords.xlsx')df.head()| national_park | state | trail | activity | overall_rating | comment_title | comment_ratings | comment_text | |
|---|---|---|---|---|---|---|---|---|
| 0 | Acadia National Park | Maine (ME) | Beech Mountain Trail | Hiking Trails | 4.5 | Turned back on 3/20/21 due to ice | 4.0 of 5 bubbles | I have hiked to the fire tower a few times. It... |
| 1 | Acadia National Park | Maine (ME) | Beech Mountain Trail | Hiking Trails | 4.5 | Spectacular | 5.0 of 5 bubbles | This trail was recommended in my Acadia travel... |
| 2 | Acadia National Park | Maine (ME) | Beech Mountain Trail | Hiking Trails | 4.5 | Great Trail | 5.0 of 5 bubbles | Beech Mountain Trail is one of my favorites in... |
| 3 | Acadia National Park | Maine (ME) | Beech Mountain Trail | Hiking Trails | 4.5 | Best trail in Acadia | 5.0 of 5 bubbles | We stumbled onto this trail and were very happ... |
| 4 | Acadia National Park | Maine (ME) | Beech Mountain Trail | Hiking Trails | 4.5 | Great trail for family | 5.0 of 5 bubbles | My family has kids ranging from age 10 to 3. W... |
df2.head()| Latitude | Longitude | Park | State(s) | Park Established | Area | Visitors (2018) | |
|---|---|---|---|---|---|---|---|
| 0 | 44.35 | -68.21 | Acadia | Maine | February 26, 1919 | 49,075.26 acres (198.6 km2) | 3537575 |
| 1 | -14.25 | -170.68 | American Samoa | American Samoa | October 31, 1988 | 8,256.67 acres (33.4 km2) | 28626 |
| 2 | 38.68 | -109.57 | Arches | Utah | November 12, 1971 | 76,678.98 acres (310.3 km2) | 1663557 |
| 3 | 43.75 | -102.50 | Badlands | South Dakota | November 10, 1978 | 242,755.94 acres (982.4 km2) | 1008942 |
| 4 | 29.25 | -103.25 | Big Bend | Texas | June 12, 1944 | 801,163.21 acres (3,242.2 km2) | 440091 |
To merge the two files together, we utilize regex. Get string preceding ‘National Park’ in df such that we can merge with df2 on National Park name
import re
pattern = r'(.*?)(?:\s+National Park)?$'
result = re.findall(pattern, df['national_park'].iloc[0])
park = []
for row in df['national_park']:
test_park = re.findall(pattern, row)
park.append(test_park[0])
df['park'] = park
national_parks = pd.merge(df, df2, left_on='park', right_on='Park')
national_parks = national_parks.drop(columns = ['park', 'Park', 'State(s)', 'Park Established'])
national_parks.head()| national_park | state | trail | activity | overall_rating | comment_title | comment_ratings | comment_text | Latitude | Longitude | Area | Visitors (2018) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Acadia National Park | Maine (ME) | Beech Mountain Trail | Hiking Trails | 4.5 | Turned back on 3/20/21 due to ice | 4.0 of 5 bubbles | I have hiked to the fire tower a few times. It... | 44.35 | -68.21 | 49,075.26 acres (198.6 km2) | 3537575 |
| 1 | Acadia National Park | Maine (ME) | Beech Mountain Trail | Hiking Trails | 4.5 | Spectacular | 5.0 of 5 bubbles | This trail was recommended in my Acadia travel... | 44.35 | -68.21 | 49,075.26 acres (198.6 km2) | 3537575 |
| 2 | Acadia National Park | Maine (ME) | Beech Mountain Trail | Hiking Trails | 4.5 | Great Trail | 5.0 of 5 bubbles | Beech Mountain Trail is one of my favorites in... | 44.35 | -68.21 | 49,075.26 acres (198.6 km2) | 3537575 |
| 3 | Acadia National Park | Maine (ME) | Beech Mountain Trail | Hiking Trails | 4.5 | Best trail in Acadia | 5.0 of 5 bubbles | We stumbled onto this trail and were very happ... | 44.35 | -68.21 | 49,075.26 acres (198.6 km2) | 3537575 |
| 4 | Acadia National Park | Maine (ME) | Beech Mountain Trail | Hiking Trails | 4.5 | Great trail for family | 5.0 of 5 bubbles | My family has kids ranging from age 10 to 3. W... | 44.35 | -68.21 | 49,075.26 acres (198.6 km2) | 3537575 |
Word Embedding and Comment Similarity Score
First let us go over what Word Embedding is. Word embedding in NLP is an important technique that is used for representing words for text analysis in the form of real-valued vectors. In this approach, words and documents are represented in the form of numeric vectors allowing similar words to have similar vector representations. The extracted features are fed into a machine learning model so as to work with text data and preserve the semantic and syntactic information. This information once received in its converted form is used by NLP algorithms that easily digest these learned representations and process textual information.
Total Trail Simularity
Now let us create a function called total_similarity which takes in the same parameters as our last function except takes in the trail name instead of comment_index. We do so because we want to get all 10 comments per trail. Our total_similarity function calls comment_similarity to get the most similar comment per each individual comment of the 10 trails. As a result, we get 10 total similar trails returned to us.
def total_similarity(trail, parks_data, all_comments):
trail_subset = parks_data[parks_data['trail'] == trail].index
total_df = []
for number in trail_subset:
total_df.append(comment_similarity(national_parks, number, all_docs))
df = pd.concat(total_df)
return(df)output = total_similarity("Landscape Arch", national_parks, all_docs)
output| national_park | state | trail | activity | overall_rating | comment_title | comment_ratings | comment_text | Latitude | Longitude | Area | Visitors (2018) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 303 | Badlands National Park | South Dakota (SD) | Pinnacles Overlook | Points of Interest & Landmarks | 5.0 | Must See Pullover | 5.0 of 5 bubbles | This is one of a handful of overlooks you have... | 43.75 | -102.50 | 242,755.94 acres (982.4 km2) | 1008942 |
| 235 | Arches National Park | Utah (UT) | Delicate Arch | Points of Interest & Landmarks | 5.0 | Delicate Arch | 5.0 of 5 bubbles | Our family chose to hike to Delicate Arch late... | 38.68 | -109.57 | 76,678.98 acres (310.3 km2) | 1663557 |
| 863 | Capitol Reef National Park | Utah (UT) | Capitol Reef National Park | National Parks | 4.5 | Add Capitol Reef to Your Utah National Park List | 5.0 of 5 bubbles | Just to the northeast of more popular parks Br... | 38.20 | -111.17 | 241,904.50 acres (979.0 km2) | 1227627 |
| 1310 | Death Valley National Park | California (CA) | Zabriskie Point | Geologic Formations | 4.5 | The Most Iconic Place in Death Valley | 4.0 of 5 bubbles | You can't miss it. I don't mean you have to do... | 36.24 | -116.82 | 3,373,063.14 acres (13,650.3 km2) | 1678660 |
| 1611 | Grand Teton National Park | Wyoming (WY) | Taggart Lake | Hiking Trails | 5.0 | Do this hike if you want to feel like you're a... | 5.0 of 5 bubbles | It's not a difficult hike and is right off the... | 43.73 | -110.80 | 310,044.22 acres (1,254.7 km2) | 3491151 |
| 222 | Arches National Park | Utah (UT) | Double Arch | Hiking Trails | 5.0 | Easy hike | 5.0 of 5 bubbles | The Double Arch is unreal. It is massive and b... | 38.68 | -109.57 | 76,678.98 acres (310.3 km2) | 1663557 |
| 3198 | Mount Rainier National Park | Washington (WA) | Sunrise Visitor Center | Visitor Centers | 4.5 | Amazing views | 5.0 of 5 bubbles | Amazing hikes of all varieties. Many travel up... | 46.85 | -121.75 | 236,381.64 acres (956.6 km2) | 1518491 |
| 1439 | Glacier National Park | Montana (MT) | Grinnell Glacier | Hiking Trails | 5.0 | Incredible vies and the end-point is rewarding | 5.0 of 5 bubbles | This 13 mile hike from Many Glacier to upper G... | 48.80 | -114.00 | 1,013,125.99 acres (4,100.0 km2) | 2965309 |
| 1366 | Glacier National Park | Montana (MT) | Virginia Falls | Waterfalls | 5.0 | Magnificent Falls in Glacier National Park - w... | 5.0 of 5 bubbles | This is the second falls on a hike in Glacier ... | 48.80 | -114.00 | 1,013,125.99 acres (4,100.0 km2) | 2965309 |
| 650 | Canyonlands National Park | Utah (UT) | Horseshoe Canyon | Canyons | 5.0 | WHOA! READ PLEASE. Things you NEED to know a... | 5.0 of 5 bubbles | There are some older reviews. Some are VERY M... | 38.20 | -109.93 | 337,597.83 acres (1,366.2 km2) | 739449 |
As we can see we get 10 similar trails to our desired trail Landscape Arch
Plotly Function
Now let us construct a geographical plot function called plotting_parks to get the location of these trails on a map. This is so that the user can better visualize where in the United States they may have to travel to. The function also analyzes other metrics from national_parks.csv such as visitors in 2018, type of activity, trail name, and overall TripAdvisor rating. This function calls total_similarity in order to get the dataframe with the most similar reviews!
from plotly import express as px
import plotly.io as pio
import inspect
pio.renderers.default="iframe"def plotting_parks(trail, parks_data, all_comments, **kwargs):
output = total_similarity(trail, parks_data, all_comments)
fig = px.scatter_mapbox(output, lon = "Longitude", lat = "Latitude", color = "overall_rating",
color_continuous_midpoint = 2.5, hover_name = "national_park", height = 600,
hover_data = ["Visitors (2018)", "activity", "trail", "overall_rating"],
title = "Recommended National Park Trails",
size_max=50,
**kwargs,
)
return figcolor_map = px.colors.diverging.RdGy_r # produce a color map
fig = plotting_parks("Landscape Arch", national_parks, all_docs, mapbox_style="carto-positron",
color_continuous_scale = color_map)fig.show()
Great, as we can see, we get a geo plot of the most similar National Park trails in the United States to Landscape Arch!
Webscraping TrailForks - Zion
Our original plan was to use a website called AllTrails which contains very comprehensive information about different hiking trails. Nonetheless, they beefed up their security measures several years ago to prevent people from scraping their website Because of this, we turned to a different site called TrailForks that, while still able to block scrapy, is unable to block Selenium.
How do you get started with Selenium?
Selenium is able to evade certain anti-bot measures by actually using an instance of a web browser (called a webdriver) that runs on your system while scraping. In fact, once you get the scraper to work, you can actually watch it run in real time! Unfortunately, that makes it a lot slower than scrapy, for instance, because your computer actually has to manually open every page. I used a Google Chrome webdriver. Below is (part of) the head of the scraper.py function which scrapes data from individual trails from TrailForks.
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import pandas as pd
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
options = webdriver.ChromeOptions()
options.add_experimental_option(
"prefs", {
# block image loading
"profile.managed_default_content_settings.images": 2,
"profile.managed_default_content_settings.javascript": 2
}
)
driver = webdriver.Chrome(
service=service,
options=options
)
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--headless')Of note are the experimental options under prefs which block both images and javascript content from loading on a website. When I first made the scraper, I did not have these enabled, and a result, sometimes pages would take between 3 and 5 seconds to load (way too long!). Similarly, the --headless argument also make the pages load faster by disabling certain Google Chrome functionalities: https://www.selenium.dev/blog/2023/headless-is-going-away/
How do you scrape using Selenium?
At its core, Selenium isn’t that different from scrapy in that you can have a scraper download HTML code which you can then filter through in Python as more familiar objects. Both scraper.py and scraper_parks.py (which filters through park-related information on TrailForks) have the same general principles: 1. Look at what state a user has inputted 2. Get links to all of the park/trail pages for that state 3. Get corresponding information from each page, using helper functions if need be 4. Add data to SQL database (see next section for that!)
scraper.py
As an example, let’s take a look at the function state_scraper in scraper.py. This is the outermost scraping function and takes in a state that the user inputs. However, for our project, we just used it to grab trails from California, since it would’ve taken a very long time to do this for the entire country, given that TrailForks has nearly 300,000 trails registered in the US alone.
def state_scraper(state_name):
"""
This is the primary scraping function used inside of `main`.
Inputs:
``href_list``: A list of each state's current page's trail urls, compiled by `main`
``state_name``: The name of the state in question.
The function iterates through each url in ``href_list`` and using selenium's Chrome
driver, it gets the `url` and calls the 3 individual scraping function: `scrape_basic_stats`,
`scrape_tables` (twice--once for trail information and one for trail statistics), and
`get_names_coords`. Once all of the information is gathered into dictionaries, they are
combined and added into `trails.db`.
"""
start_url = f"https://www.trailforks.com/region/{state_name}/trails//?activitytype=6®ion=3106"
url_list = [start_url]
for page_num in range(state_page_dict[state_name]):
url_list.append(start_url + f"&page={page_num+2}")
for page_num in range(state_page_dict[state_name]):
href_list = []
driver.get(url_list[page_num])
green_links = driver.find_elements("xpath","//tr//a[contains(@class, 'green')]")
for green in green_links:
href = green.get_attribute("href") #Grab URLs--otherwise this doesn't work
href_list.append(href)
for url in href_list:
trail_keywords = ["Activities", "Riding Area", "Difficulty Rating", "Local Popularity"]
stats_keywords = ["Altitude start", "Altitude end", "Grade"]
basic_vars_dict = {"Distance": ["NA"], "Avg time": ["NA"], "Climb": ["NA"], "Descent": ["NA"]}
trail_details_vars = {"Activities": ["NA"], "Riding Area": ["NA"], "Difficulty Rating": ["NA"], "Local Popularity": ["NA"]}
trail_stats_vars = { "Altitude start": ["NA"], "Altitude end": ["NA"], "Grade": ["NA"]}
name_coord_dict = {'Name': ["NA"],'Coords': ["NA"]}
print(url)
driver.get(url)
basic_vars = scrape_basic_stats(basic_vars_dict)
trail_stats = scrape_tables(stats_keywords, 'trailstats_display', trail_stats_vars)
trail_details = scrape_tables(trail_keywords, 'traildetails_display', trail_details_vars)
names_coords = get_names_coords(name_coord_dict)
database_info.add_trails(pd.DataFrame({**names_coords,**basic_vars, **trail_details, **trail_stats}),state_name)
driver.quit()There’s a lot going on here! I’ll attach screenshots from TrailForks here to make it easier to follow along. The first thing that we do is establish a starting url which we get from TrailForks thanks to an f-string that contains the name of the state that we want. We then append that url to a url list. This list is important because we couldn’t actually get this to work without having it, since going back and forward is a bit harder in Selenium than it is in scrapy.
Then, we store the urls for all of the trail list pages that correspond to a given state, which on the website, looks like this:
The numbers come from a dictionary saved in database_info.py. Before making scraper.py, I scraped the page numbers.
Then, on each page, we grab the urls to individual trails. Note that Selenium uses a class called driver to interface with the page that it’s scraping, and the key methods here are called get (which just navigates to the page) and find_elements, which has two arguments. The first argument specifies how the scraper should interpret the second argument, which is some sort of instruction. For example, when we grab all of the links on the page to the trails, we use
driver.find_elements("xpath","//tr//a[contains(@class, 'green')]")where the second input is an xpath instruction that find_elements can read thanks to the fact that the first argument is simply a string that says xpath. There’s a lot of documentation for xpath available online, which I started to investigate for the movie database homework assignment. This website came in handy: https://www.w3schools.com/xml/xpath_syntax.asp
In the above example, we grab all links (a) contained in all table rows (tr) given that they are of class green (not sure why they’re calle that, but links to trails are of that class). Once we actually load it into python, though, we have to call
for green in green_links:
href = green.get_attribute("href") #Grab URLs--otherwise this doesn't work
href_list.append(href)in a for loop to specifically take out the link portion of each a and then add it to the URL list.
## Individual Scraping Functions
Now we can get to the fun part. We essentially give a set of instructions to go to each trail’s page armed with a set of both lists containing keywords as well as dictionaries. The idea here is to search for specific variables stored in each trail and then update the value inside of the corresponding dictionary. As you can see, we split this task into four parts:
basic_vars = scrape_basic_stats(basic_vars_dict)
trail_stats = scrape_tables(stats_keywords, 'trailstats_display', trail_stats_vars)
trail_details = scrape_tables(trail_keywords, 'traildetails_display', trail_details_vars)
names_coords = get_names_coords(name_coord_dict)Let’s go over each one briefly.
###scraper_basic_stats
scraper_basic_stats takes the basic_vars_dict as an input:
def scrape_basic_stats(basic_vars_dict):
"""
This function should be called within the `state_scraper` function.
Inputs:
``basic_vars_dict``: a dictionary with the relevant variable names as keys and, by default,
`NA` as values.
On a given trail's website, it will find a `div` element of id `basicTrailStats`.
This is a table containing information about the trail's length, average time to
completion, ascent, and descent, if those variables are available.
The function looks for `div`s of ID `padded10` which contain `col-3` class `div`s.
The name of each variable is the `text` of the `small.grey` class within each `col-3`.
The variable's value is stored in the `text` of `div` `large` or `large hovertip`
The function uses conditional statements as not all of the variables are available.
"""
try:
basic_stats_div = driver.find_element(By.ID, "basicTrailStats")
padded10_divs = basic_stats_div.find_elements(By.CLASS_NAME, "padded10")
for padded10_div in padded10_divs:
col_3_divs = padded10_div.find_elements(By.CLASS_NAME, "col-3")
for col_3_div in col_3_divs:
small_grey_div = col_3_div.find_element(By.CLASS_NAME, "small.grey")
variable_name = small_grey_div.text
large_div = col_3_div.find_element(By.CSS_SELECTOR, ".large, .large.hovertip")
variable_value = large_div.text
if variable_name in basic_vars_dict:
basic_vars_dict[variable_name] = variable_value
except:
pass
return basic_vars_dictThis function essentialy looks for the light gray box at the top of each trail’s page which has some basic information like distance, climb, descent, and average completion time. This box has an ID of basicTrailStats (see how we use find_element with By.ID?) and contains a div called padded10, which in turn contains divs of col-3. These essentially work like dictionaries. Each one has a div of class large and small grey (note that we use periods here) whose text contains the information we want.
We have to make sure that each variable that we want is in our dictionary and that furthermore, it’s present on the trail page. Some trails have sparse or non-existent information for certain variables, and as a result, the specified variables simply don’t exist. This is also why this (and all other) helper function has its body in a try/except format, since it’s possible for basicTrailStats to not exist at all (only on rare occasions).
###scrape_tables
def scrape_tables(var_list, element_id, var_dict):
"""
This function is used within `state_scraper` to access `ul`s with variable information.
Inputs:
``var_list``: a list ofthe relevant variable names.
``element_id``: the `id` of the `ul` in question--there are two which need to be scraped.
``var_dict``: a dictionary with the relevant variable names as keys and, by default,
`NA` as values.
Each `ul` contains `li`s formatted like a dictionary with a `div` of class `term`,
which stores variable names, and a `div` of class `definition` which stores variable
values. Both `div`s store the key information in their `text`. Since not all variables
are needed, we check if they are in ``var_list`` before assigning them to
``var_dict``'s values.
"""
try:
li_elements = driver.find_elements("xpath", f"//ul[contains(@id, '{element_id}')]//li")
for li_element in li_elements:
term_div = li_element.find_elements(By.CLASS_NAME, "term")
definition_div = li_element.find_elements(By.CLASS_NAME, "definition")
for idx, terms in enumerate(term_div):
if terms.text in var_list:
var_dict[terms.text] = [definition_div[idx].text]
except:
pass
return var_dictThis function is called twice in the body of the scraper because there are two similarly structured tables (one with general trail information and one with more detailed statistics) that we want to grab certain variables from. The tables are unordered lists (uls) which contain list elements (lis) that we can conveniently loop through by using the find_elements function which automatically converts it into a python iterable (similarly to scrapy). The lists are once again sorted like dictionaries, in fat, they contain classes called terms and definitions, which we then parse through one by one and use to update our dictionary, which is then returned.
### get_names_coords()
def get_names_coords(name_coord_dict):
"""
This function is used within `state_scraper` to access each trail's name and coordinates.
Inputs:
``name_coord_dict``: a dictionary with keys `Name` and `Coords`. By default, the values are `NA`
The function grabs the trail's name from a `span` of class `translate` from the top of the page.
It also grabs the coordinates from a `span` of class `grey2` within a `div` of class
`margin-bottom-15`. The coordinates are stored in the `span`'s `text`.
"""
try:
name_raw = driver.find_element("xpath", "//span[contains(@class, 'translate')][1]")
name_coord_dict['Name'] = [name_raw.text]
except:
name_coord_dict['Name'] = "NA"
try:
coord_raw = driver.find_element("xpath", "//div[contains(@class, 'margin-bottom-15 grey')]/span[contains(@class, 'grey2')][2]") #Get coords
name_coord_dict['Coords'] = coord_raw.text
except:
name_coord_dict['Coords'] = "NA"
return name_coord_dictThis is a bit more of a miscellaneous function, since it’s not dedicated to any one purpose. All it does is store the name of the trail as well as its coordinates. These are stored as two different classes of divs found on different parts of the page. Then, we just grab their text. Simple as that!
Once we grab all of that information, we add it to our SQL datbase:
database_info.add_trails(pd.DataFrame({**names_coords,**basic_vars, **trail_details, **trail_stats}),state_name)which I’ll talk about in the next section.
scraper_parks.py
This file also contains a function called state_scraper which once again takes a state’s name as its input, but this time, it’s focused on collecting information about parks rather than trails (on TrailForks, there are many trails within one park). I was able to run this to get some numerical data about parks throughout the US (including national parks) which was then integrated into the website and recommender system.
This time, I don’t use helper functions (rather, I just keep it in the state_scraper’s body), so we’ll go through it bit by bit. Firstly, it’s worth going over what we actually want from each park’s page:
- Each park’s name and location
- The number of trails in each park, how long the trails are, and the popularity ranking.
- How many trails there are of each difficulty
##Set-up
The settings for Selenium are the same as in the last case. The pre-scraping part is almost identical as well:
start_url = f"https://www.trailforks.com/region/{state_name}/ridingareas/?activitytype=6"
url_list = [start_url]
for page_num in range(database_info.state_dictionary[state_name]):
url_list.append(start_url + f"&page={page_num+2}")
for page_num in range(state_page_dict[state_name]):
href_list = []
driver.get(url_list[page_num])
green_links = driver.find_elements("xpath","//tr//a[contains(@class, 'green')]")
for green in green_links:
href = green.get_attribute("href") #Grab URLs--otherwise this doesn't work
href_list.append(href+"/?activitytype=6")Once again, I pre-scraped the number of pages required per state, though because there are fewer parks than trails, it wasn’t as big of a load on my computer.
Names and coordinates
Here’s the first part of where we actually scrape:
for url in href_list:
no_name_found = False
info_dict = {"Name":["NA"], "Location":["NA"], "Coords":["NA"]}
stats_dict = {"Trails (view details)":["NA"],"Total Distance":["NA"], "State Ranking":["NA"],}
trail_difficulty_count = {"Access Road/Trail":0,"White":0,"Green":0,"Blue":0,"Black":0,"Double Black Diamond":0, "Proline":0}
print(url)
driver.get(url)
area_name_raw = driver.find_element("xpath", "//span[contains(@class, 'translate')][1]")
info_dict["Name"] = area_name_raw.text
try:
city_name_raw = driver.find_element(By.CLASS_NAME, "small.grey2.mobile_hide")
info_dict["Location"] = city_name_raw.text
except:
no_name_found = True
We create our three dictionaries that we want for each URL. The print(url) function is present as a debugging tool since, unfortunately, this scraper crashed multiple times due to unfixed bugs (which I eventually patched out, mostly due to elements not being present).
We get the name of the trail by finding a span with class translate (not sure why it’s stored like that, it’s actually within an h1 within a ul called page_title_container). Then, we try to look for the name of the city that it’s in by grabbing a small piece of text that’s next to the park’s name. Sometimes, this isn’t present, which is why we have a bool called no_name_found in case it’s not. There’s a way around this, though, which we’ll show later…
###Ranking, Distance, and Trail numbers:
stats_items = ["State Ranking", "Total Distance", "Trails (view details)"]
dict_category = driver.find_elements("xpath", "//dl//dt")
dict_information = driver.find_elements("xpath", "//dl//dd")
for idx, terms in enumerate(dict_category):
if terms.text in stats_items:
stats_dict[terms.text] = [dict_information[idx].text]
try:
difficulty_ul = driver.find_element(By.CLASS_NAME, 'stats.flex.nostyle.inline.clearfix')
for li in difficulty_ul.find_elements(By.TAG_NAME, 'li'):
difficulty_span = li.find_element(By.XPATH, './/span[contains(@class, "stat-label clickable")]/span')
difficulty_name = difficulty_span.get_attribute('title')
if difficulty_name in trail_difficulty_count.keys():
num_trails_span = li.find_element(By.CLASS_NAME, 'stat-num')
num_trails = int(num_trails_span.text)
trail_difficulty_count[difficulty_name] = num_trailsThe code here is somewhat dense thanks to the fact that all of this information is stored in a dictionary-like object called a dl which, in turn, has something like a key in a dl and something like a value in a dd. Essentially, we update the ranking and trail distances by inspecting these.
It’s a little bit harder to get the number of trails per difficulty. Basically, there’s an unordered list with a long class name ('stats.flex.nostyle.inline.clearfix' that sorts the number of trails by difficulty. Each li has the number of trails stored within it, but it also has a graphic that represents the difficulty (it’s a small picture), and it’s the graphic that actually hides the name of the difficulty, which is why we have to extract difficulty_name from a span of class stat-label clickable. Then, we simply grab the actual text that displays how many trails of a given difficulty there are, convert it to an integer, and then add it to our dictionary.
Coordinates
One of the unfortunate parts of the parks list is that the coordinates of each park are not present! To get around this, we tell the scraper to go to the first trail in each park and grab its coordinates (remember scraper.py?) and then store it.
try:
green_link = driver.find_element("xpath","//tr//a[contains(@class, 'green')]")
park_link = green_link.get_attribute("href")
driver.get(park_link)
except:
pass
try:
coord_raw = driver.find_element("xpath", "//div[contains(@class, 'margin-bottom-15 grey')]/span[contains(@class, 'grey2')][2]") #Get coords
info_dict['Coords'] = [coord_raw.text]
if no_name_found:
city_name_raw = driver.find_element(By.CLASS_NAME, "weather_date bold green")
info_dict["Location"] = city_name_raw.text
except:
info_dict['Coords'] = ["NA"]It’s here where we also resolve the issue of when we can’t find a city’s name. Basically, on each trail’s page, there’s a short infobox containing weather information for the nearest city which is guaranteed to appear, so we can get an approximate location name precisely by grabbing the city name from this box.
Once we’re done with that, it’s off to the database again!
SQL Database - Zion
There’s a lot of information that we scrape from TrailForks which has to be managed within a SQL datbase for easy access. For example, California has more than 16,000 trails, and for each trail, we collected 12 variables (see above), so that means that there’s more than 192,000 entries! We used sqlite3 to easily manage a database, or rather, two: one is called trails.db and contains individual trails (specifically, those in California, though our original plan was to include the entire country), and one called trails_new.db (now that I think about it, I probably should’ve given it a different name) which contains park information, where each state has a different table.
##database_info.py
Everything relevant to managing the datbase is stored in a different python file called database_info.py. Here I can show you the structure of both databases:
###Making the databases
def make_db(state):
conn = sqlite3.connect("trails.db")
cmd = f"""
CREATE TABLE IF NOT EXISTS {state_name_code_name_dict[state]}(
name VARCHAR(255),
coords VARCHAR(255),
Distance VARCHAR(255),
'Avg time' VARCHAR(255),
Climb VARCHAR(255),
Descent VARCHAR(255),
Activities VARCHAR(255),
'Riding Area' VARCHAR(255),
'Difficulty Rating' VARCHAR(255),
'Dogs Allowed' VARCHAR(255),
'Local Popularity' VARCHAR(255),
'Altitude start' VARCHAR(255),
'Altitude end' VARCHAR(255),
Grade VARCHAR(255)
);
"""
cursor = conn.cursor()
cursor.execute(cmd)
cursor.close()
conn.close()
def make_db_parks(state):
conn = sqlite3.connect("trails_new.db")
cmd = f"""
CREATE TABLE IF NOT EXISTS {state_name_code_name_dict[state]}(
Name VARCHAR(255),
Location VARCHAR(255),
Coords VARCHAR(255),
'Trails (view details)' SMALLINT(255),
'Total Distance' VARCHAR(255),
'State Ranking' VARCHAR(255),
'Access Road/Trail' SMALLINT(255),
White SMALLINT(255),
Green SMALLINT(255),
Blue SMALLINT(255),
Black SMALLINT(255),
'Double Black Diamond' SMALLINT(255),
Proline SMALLINT(255)
);
"""
cursor = conn.cursor()
cursor.execute(cmd)
cursor.close()
conn.close()These two functions were run in order to actually create the datbaase for the first time. They contain the variables as mentioned previously, mostly in the form of text.
Adding information
If you recall from the scraping functions, there was a function call that would add information from each park to the SQL database. Here’s the source code for those functions:
def get_db():
conn = sqlite3.connect("trails.db")
return conn
def add_trails(df,state):
conn = get_db()
df.to_sql(state, conn, if_exists = "append", index = False)
def get_db_new():
conn = sqlite3.connect("trails_new.db")
return conn
def add_trails_new(df,state):
conn = get_db_new()
df.to_sql(state, conn, if_exists = "append", index = False)The functions get_db and get_db_new (most things relating to scraper_parks are labeled new since we did this second) establish connections to their respective databases. add_trails and add_trails_new, therefore, are actually responsible for adding entries to each database. Note that they take a df as one input (which contains the scraped info) and a state name, which sends the information to the correct table.
Miscellaneous Tables
There are several dictionaries and lists that we generated in order to make the functions easier to run:
states = ["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "idaho-3166", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "new-hampshire", "new-jersey", "new-mexico", "new-york", "north-carolina", "north-dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "rhode-island", "south-carolina", "south-dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "west-virginia", "Wisconsin", "Wyoming"]
state_name_code_name_dict = {
'Alabama': 'Alabama',
'Alaska': 'Alaska',
'Arizona': 'Arizona',
'Arkansas': 'Arkansas',
'California': 'California',
'Colorado': 'Colorado',
'Connecticut': 'Connecticut',
'Delaware': 'Delaware',
'Florida': 'Florida',
'Georgia': 'Georgia',
'Hawaii': 'Hawaii',
'idaho-3166': 'Idaho',
'Illinois': 'Illinois',
'Indiana': 'Indiana',
'Iowa': 'Iowa',
'Kansas': 'Kansas',
'Kentucky': 'Kentucky',
'Louisiana': 'Louisiana',
'Maine': 'Maine',
'Maryland': 'Maryland',
'Massachusetts': 'Massachusetts',
'Michigan': 'Michigan',
'Minnesota': 'Minnesota',
'Mississippi': 'Mississippi',
'Missouri': 'Missouri',
'Montana': 'Montana',
'Nebraska': 'Nebraska',
'Nevada': 'Nevada',
'new-hampshire': 'NewHampshire',
'new-jersey': 'NewJersey',
'new-mexico': 'NewMexico',
'new-york': 'NewYork',
'north-carolina': 'NorthCarolina',
'north-dakota': 'NorthDakota',
'Ohio': 'Ohio',
'Oklahoma': 'Oklahoma',
'Oregon': 'Oregon',
'Pennsylvania': 'Pennsylvania',
'rhode-island': 'RhodeIsland',
'south-carolina': 'SouthCarolina',
'south-dakota': 'SouthDakota',
'Tennessee': 'Tennessee',
'Texas': 'Texas',
'Utah': 'Utah',
'Vermont': 'Vermont',
'Virginia': 'Virginia',
'Washington': 'Washington',
'west-virginia': 'WestVirginia',
'Wisconsin': 'Wisconsin',
'Wyoming': 'Wyoming'
}
state_dictionary = {'Alabama': 11, 'Alaska': 11, 'Arizona': 49, 'Arkansas': 16, 'California': 152, 'Colorado': 69, 'Connecticut': 56, 'Delaware': 4, 'Florida': 18, 'Georgia': 17, 'Hawaii': 5, 'idaho-3166': 31, 'Illinois': 51, 'Indiana': 10, 'Iowa': 8, 'Kansas': 3, 'Kentucky': 9, 'Louisiana': 2, 'Maine': 27, 'Maryland': 16, 'Massachusetts': 146, 'Michigan': 55, 'Minnesota': 36, 'Mississippi': 3, 'Missouri': 11, 'Montana': 41, 'Nebraska': 3, 'Nevada': 16, 'new-hampshire': 41, 'new-jersey': 40, 'new-mexico': 25, 'new-york': 60, 'north-carolina': 26, 'north-dakota': 7, 'Ohio': 29, 'Oklahoma': 4, 'Oregon': 38, 'Pennsylvania': 54, 'rhode-island': 9, 'south-carolina': 6, 'south-dakota': 7, 'Tennessee': 16, 'Texas': 50, 'Utah': 62, 'Vermont': 25, 'Virginia': 27, 'Washington': 92, 'west-virginia': 18, 'Wisconsin': 25, 'Wyoming': 19}
state_parks_dictionary = {'Alabama': 1, 'Alaska': 1, 'Arizona': 3, 'Arkansas': 2, 'California': 8, 'Colorado': 4, 'Connecticut': 7, 'Delaware': 1, 'Florida': 2, 'Georgia': 2, 'Hawaii': 1, 'idaho-3166': 2, 'Illinois': 10, 'Indiana': 1, 'Iowa': 1, 'Kansas': 1, 'Kentucky': 1, 'Louisiana': 1, 'Maine': 3, 'Maryland': 1, 'Massachusetts': 7, 'Michigan': 5, 'Minnesota': 3, 'Mississippi': 1, 'Missouri': 2, 'Montana': 2, 'Nebraska': 1, 'Nevada': 1, 'new-hampshire': 3, 'new-jersey': 3, 'new-mexico': 2, 'new-york': 5, 'north-carolina': 3, 'north-dakota': 1, 'Ohio': 4, 'Oklahoma': 1, 'Oregon': 3, 'Pennsylvania': 3, 'rhode-island': 1, 'south-carolina': 1, 'south-dakota': 1, 'Tennessee': 2, 'Texas': 4, 'Utah': 3, 'Vermont': 2, 'Virginia': 2, 'Washington': 6, 'west-virginia': 2, 'Wisconsin': 3, 'Wyoming': 1}state_dictionary and state_parks_dictionary store the number of pages required for each state. states simply contains the names of all the states in alphabetical order, and state_name_code_dict helps sort between the name of a state and the way in which it is displayed on TrailForks URLs.
Connecting National Parks to Individual Trail/Park Info
Now we need to make sure to connect the data that we’ve collected here with the actual table generated by the recommender to give the user more information. Let’s take a look at our output from the similarity score model:
output| national_park | state | trail | activity | overall_rating | comment_title | comment_ratings | comment_text | Latitude | Longitude | Area | Visitors (2018) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 303 | Badlands National Park | South Dakota (SD) | Pinnacles Overlook | Points of Interest & Landmarks | 5.0 | Must See Pullover | 5.0 of 5 bubbles | This is one of a handful of overlooks you have... | 43.75 | -102.50 | 242,755.94 acres (982.4 km2) | 1008942 |
| 235 | Arches National Park | Utah (UT) | Delicate Arch | Points of Interest & Landmarks | 5.0 | Delicate Arch | 5.0 of 5 bubbles | Our family chose to hike to Delicate Arch late... | 38.68 | -109.57 | 76,678.98 acres (310.3 km2) | 1663557 |
| 863 | Capitol Reef National Park | Utah (UT) | Capitol Reef National Park | National Parks | 4.5 | Add Capitol Reef to Your Utah National Park List | 5.0 of 5 bubbles | Just to the northeast of more popular parks Br... | 38.20 | -111.17 | 241,904.50 acres (979.0 km2) | 1227627 |
| 1310 | Death Valley National Park | California (CA) | Zabriskie Point | Geologic Formations | 4.5 | The Most Iconic Place in Death Valley | 4.0 of 5 bubbles | You can't miss it. I don't mean you have to do... | 36.24 | -116.82 | 3,373,063.14 acres (13,650.3 km2) | 1678660 |
| 1611 | Grand Teton National Park | Wyoming (WY) | Taggart Lake | Hiking Trails | 5.0 | Do this hike if you want to feel like you're a... | 5.0 of 5 bubbles | It's not a difficult hike and is right off the... | 43.73 | -110.80 | 310,044.22 acres (1,254.7 km2) | 3491151 |
| 222 | Arches National Park | Utah (UT) | Double Arch | Hiking Trails | 5.0 | Easy hike | 5.0 of 5 bubbles | The Double Arch is unreal. It is massive and b... | 38.68 | -109.57 | 76,678.98 acres (310.3 km2) | 1663557 |
| 3198 | Mount Rainier National Park | Washington (WA) | Sunrise Visitor Center | Visitor Centers | 4.5 | Amazing views | 5.0 of 5 bubbles | Amazing hikes of all varieties. Many travel up... | 46.85 | -121.75 | 236,381.64 acres (956.6 km2) | 1518491 |
| 1439 | Glacier National Park | Montana (MT) | Grinnell Glacier | Hiking Trails | 5.0 | Incredible vies and the end-point is rewarding | 5.0 of 5 bubbles | This 13 mile hike from Many Glacier to upper G... | 48.80 | -114.00 | 1,013,125.99 acres (4,100.0 km2) | 2965309 |
| 1366 | Glacier National Park | Montana (MT) | Virginia Falls | Waterfalls | 5.0 | Magnificent Falls in Glacier National Park - w... | 5.0 of 5 bubbles | This is the second falls on a hike in Glacier ... | 48.80 | -114.00 | 1,013,125.99 acres (4,100.0 km2) | 2965309 |
| 650 | Canyonlands National Park | Utah (UT) | Horseshoe Canyon | Canyons | 5.0 | WHOA! READ PLEASE. Things you NEED to know a... | 5.0 of 5 bubbles | There are some older reviews. Some are VERY M... | 38.20 | -109.93 | 337,597.83 acres (1,366.2 km2) | 739449 |
Because we have two different SQL databases, one for nation-wide park data (trails_new.db) and one with state-wide trail data (trails.db), let’s split this into two different frames.
california_df = output[output['state'] == 'California (CA)']
non_california_df = output[output['state'] != 'California (CA)']Now we’ll get our databases in our notebook:
!wget https://raw.githubusercontent.com/torwar02/trails/main/trails/trails.db -O trails.db
!wget https://raw.githubusercontent.com/torwar02/trails/main/trails/trails_new.db -O trails_new.db--2024-03-22 20:55:28-- https://raw.githubusercontent.com/torwar02/trails/main/trails/trails.db
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3534848 (3.4M) [application/octet-stream]
Saving to: ‘trails.db’
trails.db 100%[===================>] 3.37M --.-KB/s in 0.07s
2024-03-22 20:55:28 (48.8 MB/s) - ‘trails.db’ saved [3534848/3534848]
--2024-03-22 20:55:28-- https://raw.githubusercontent.com/torwar02/trails/main/trails/trails_new.db
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1339392 (1.3M) [application/octet-stream]
Saving to: ‘trails_new.db’
trails_new.db 100%[===================>] 1.28M --.-KB/s in 0.06s
2024-03-22 20:55:28 (23.1 MB/s) - ‘trails_new.db’ saved [1339392/1339392]
There’s a bit of an issue, though. Let’s look at our table names:
import sqlite3
db_path = 'trails_new.db'
conn = sqlite3.connect(db_path) #Establish connection with DB
cur = conn.cursor()
cur.execute("SELECT name FROM sqlite_master WHERE type='table';") #This specifically grabs all table names from our datbaase.
tables = cur.fetchall()
table_names = [table[0] for table in tables] #Places them into a list
print("List of tables in the database:", table_names)
conn.close()List of tables in the database: ['Maine', 'California', 'Alabama', 'Alaska', 'Arizona', 'Arkansas', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'idaho-3166', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'new-hampshire', 'new-jersey', 'new-mexico', 'new-york', 'north-carolina', 'north-dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'rhode-island', 'south-carolina', 'south-dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'west-virginia', 'Wisconsin', 'Wyoming']
Our tables aren’t completely in alphabetical order (I was testing around with Maine first, for instance). And some of them aren’t two words, like south-dakota, for instance. But if we compare this to what we have in output:
set(output['state']){'California (CA)',
'Montana (MT)',
'South Dakota (SD)',
'Utah (UT)',
'Washington (WA)',
'Wyoming (WY)'}
Here we have nice, capitalized state names with two-letter abbreviations. So, then, how are we going to fix this? We’re going to create a dictionary that essentially works as a mapping that takes what we have in output and matches it to what exists in table_names based on some matching criteria:
# Extract unique states and sort them
unique_states_in_output = sorted(set(output['state']), key=str.lower)
table_names = sorted(table_names, key=str.lower)
def compare_letters(state_name, table_name):
clean_state_name = ''.join(filter(str.isalpha, state_name)).lower() #Eliminate non-alphabetical characters, condense together
clean_table_name = ''.join(filter(str.isalpha, table_name)).lower()
return sorted(clean_state_name) == sorted(clean_table_name) #Gives a boolean value.
state_name_to_table_name = {} #Create new dictionary
for state_with_abbreviation in unique_states_in_output:
state_name = state_with_abbreviation.split(' (')[0] # Get rid of the parentheses in the abbreviation (like 'South Dakota (SD)')
match = next((table for table in table_names if compare_letters(state_name, table)), None) #Generator based on whether or not names are the same
if match:
state_name_to_table_name[state_with_abbreviation] = match #Update dict if match found
print(state_name_to_table_name){'California (CA)': 'California', 'Montana (MT)': 'Montana', 'South Dakota (SD)': 'south-dakota', 'Utah (UT)': 'Utah', 'Washington (WA)': 'Washington', 'Wyoming (WY)': 'Wyoming'}
Now that’s what we’re looking for! We do a few important things here:
Firstly, we make sure to get both the states that we have in output and the tables in table_names in alphabetical order. The reason why we do key=str.lower is because some of the table names are written in uppercase while others are in lowercase. This makes it case-insensitive.
Then we create a helper function called compare_letters which takes two state names (one from output, one from the database) and compares them to see if they have the same letters. We do this by filtering out non-alphabetical characters, spaces, and making everything lowercase and just checking if they have the same letters. The function will just return True or False depending on whether or not they match.
We actually use state_name_to_table_name in the for loop below this. We go through each of the states in output. Then, we extract just the part of the state name that comes before the two-letter abbreviation, and then we create a generator that individualls calls compare_letters on each of the names. If it returns True, then we have a match, which then causes the dictionary to be updated. Otherwise, nothing happens and we simply move onto the next entry (that’s why the second argument of next is none).
##Logic for linking databases
Our goal is to now go through each recommendation, and match up either the park or trail information corresponding to it (assuming that it’s present in the database). One issue that can arise with this, however, is that the name of the park in output might be different from that of the database. To mitigate this, we’re going to instead compare the coordinates of what’s in output to the rows inside of trails_new.db and trails.db. The idea is that if two parks are close enough to each other in terms of their coordinates, then they should represent the same thing. So, we’re going to make two functions that do similar (but different) things. One will be called fetch_park_info_based_on_coords which will look at parks (i.e., outside of California), and the other will be called fetch_trail_info_based_on_coords
def fetch_park_info_based_on_coords(db_name, latitude, longitude, margin_lat, margin_long):
conn = sqlite3.connect(db_name)
cursor = conn.cursor() #Connect to database
for table_name in state_name_to_table_name.values(): #This is what we made earlier
cursor.execute(f"SELECT * FROM \"{table_name}\"") #Grab everything from the table
rows = cursor.fetchall()
for row in rows: #For each row
coords_text = row[2] # Coords are in the third column
try:
coords = eval(coords_text) #Kept as a tuple, essentially
lat_diff = abs(coords[0] - latitude)
long_diff = abs(coords[1] - longitude)
if lat_diff < margin_lat and long_diff < margin_long:
return row[3:] # Don't need name and coords
except:
continue
conn.close()
return None
def fetch_trail_info_based_on_coords(db_name, latitude, longitude, margin_lat, margin_long):
conn = sqlite3.connect(db_name)
cursor = conn.cursor()
table_name = 'California' #Only getting CA trails
cursor.execute(f"SELECT * FROM {table_name}") #Grab everything
rows = cursor.fetchall()
for row in rows:
coords_text = row[1] # Coords are in column 2
try:
coords = eval(coords_text)
lat_diff = abs(coords[0] - latitude)
long_diff = abs(coords[1] - longitude)
if lat_diff < margin_lat and long_diff < margin_long:
return row[2:]
except:
continue # Skip rows with invalid 'Coords'
conn.close()
return NoneOkay, so, it will make a lot more sense if we actually inspect the structure of our database again. Click the link below to see screenshots of two .csv files: the first of parks in Wyoming, and the second is of trails in California:
https://imgur.com/a/6fCixEt
With that out of the way, let’s dive into the code. We go through the mapping dictionary that we made previously and we grab all of the possible parks from each one. Then, we look at the third column (i.e., row[2], which represents the third entry in the row) which corresponds to the coordinates (see screenshot), and we record the absolute difference in the coordinates between a given latitude and longitude (we’ll be taking those from output–they’re individual columns rather than a tuple). If both of them are within a specified margin of error, then we’ve found our match. Note that we’re only going to return everything starting from the third column: the first 2 are just the name and coordinates of the trail.
For fetch_trail_info_based_on_coords, we have a very similar set-up except for the fact that the coordinates are in the second column, and we’re interested in returning everything after the first two.
Now, let’s move on so we can see how we actually use these functions!
Putting it all together
The first thing we’re going to do is to specify the names of the new columns that we want to put into california_df and non_california_df. I’ve just grabbed these from the database:
new_columns = [
'Trails (view details)', 'Total Distance', 'State Ranking',
'Access Road/Trail', 'White', 'Green', 'Blue', 'Black',
'Double Black Diamond', 'Proline'
]
new_trail_columns = [
'Distance', 'Avg time', 'Climb', 'Descent', 'Activities',
'Riding Area', 'Difficulty Rating', 'Dogs Allowed',
'Local Popularity', 'Altitude start', 'Altitude end', 'Grade'
]Now, all we need to do is iterate through the rows of non_california_df to match up the entires!
margin_lat = 0.1 # Decently generous
margin_long = 0.1
for index, row in non_california_df.iterrows():
if pd.isna(row['Latitude']) or pd.isna(row['Longitude']): #Some parks have NA coordinates
continue
park_info = fetch_park_info_based_on_coords('trails_new.db', row['Latitude'], row['Longitude'], margin_lat, margin_long)
#Remember, this grabs almost all of the columns if a match is found
if park_info:
non_california_df.loc[index, new_columns] = park_info #We can mass-add new columnsIn the above code, we use the fetch_park_info_based_on_coords function to essentially create a new data frame that contains the information that we want once we match the coordinates. Then, we insert all of these as new columns, taking advantage of the .loc() method from pandas. Now let’s do the same thing for the California df:
for index, row in california_df.iterrows():
if pd.isna(row['Latitude']) or pd.isna(row['Longitude']):
continue
park_info = fetch_trail_info_based_on_coords('trails.db', row['Latitude'], row['Longitude'], margin_lat, margin_long)
if park_info and len(park_info) == len(new_trail_columns):
california_df.loc[index, new_trail_columns] = park_info
else:
passOkay, let’s take a look at our results!
non_california_df| national_park | state | trail | activity | overall_rating | comment_title | comment_ratings | comment_text | Latitude | Longitude | ... | Trails (view details) | Total Distance | State Ranking | Access Road/Trail | White | Green | Blue | Black | Double Black Diamond | Proline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 303 | Badlands National Park | South Dakota (SD) | Pinnacles Overlook | Points of Interest & Landmarks | 5.0 | Must See Pullover | 5.0 of 5 bubbles | This is one of a handful of overlooks you have... | 43.75 | -102.50 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 235 | Arches National Park | Utah (UT) | Delicate Arch | Points of Interest & Landmarks | 5.0 | Delicate Arch | 5.0 of 5 bubbles | Our family chose to hike to Delicate Arch late... | 38.68 | -109.57 | ... | 40 | 50 miles | #7,493 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 863 | Capitol Reef National Park | Utah (UT) | Capitol Reef National Park | National Parks | 4.5 | Add Capitol Reef to Your Utah National Park List | 5.0 of 5 bubbles | Just to the northeast of more popular parks Br... | 38.20 | -111.17 | ... | 60 | 194 miles | #9,609 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1611 | Grand Teton National Park | Wyoming (WY) | Taggart Lake | Hiking Trails | 5.0 | Do this hike if you want to feel like you're a... | 5.0 of 5 bubbles | It's not a difficult hike and is right off the... | 43.73 | -110.80 | ... | 26 | 53 miles | #4,761 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 222 | Arches National Park | Utah (UT) | Double Arch | Hiking Trails | 5.0 | Easy hike | 5.0 of 5 bubbles | The Double Arch is unreal. It is massive and b... | 38.68 | -109.57 | ... | 40 | 50 miles | #7,493 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3198 | Mount Rainier National Park | Washington (WA) | Sunrise Visitor Center | Visitor Centers | 4.5 | Amazing views | 5.0 of 5 bubbles | Amazing hikes of all varieties. Many travel up... | 46.85 | -121.75 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1439 | Glacier National Park | Montana (MT) | Grinnell Glacier | Hiking Trails | 5.0 | Incredible vies and the end-point is rewarding | 5.0 of 5 bubbles | This 13 mile hike from Many Glacier to upper G... | 48.80 | -114.00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1366 | Glacier National Park | Montana (MT) | Virginia Falls | Waterfalls | 5.0 | Magnificent Falls in Glacier National Park - w... | 5.0 of 5 bubbles | This is the second falls on a hike in Glacier ... | 48.80 | -114.00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 650 | Canyonlands National Park | Utah (UT) | Horseshoe Canyon | Canyons | 5.0 | WHOA! READ PLEASE. Things you NEED to know a... | 5.0 of 5 bubbles | There are some older reviews. Some are VERY M... | 38.20 | -109.93 | ... | 25 | 177 miles | #9,011 | 9.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
9 rows × 22 columns
Success! It looks like we unfortunately have a few NA values. Unfortunately, it’s hard to guarantee precision in the coordinates. We only had one trail for California:
california_df| national_park | state | trail | activity | overall_rating | comment_title | comment_ratings | comment_text | Latitude | Longitude | Area | Visitors (2018) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1310 | Death Valley National Park | California (CA) | Zabriskie Point | Geologic Formations | 4.5 | The Most Iconic Place in Death Valley | 4.0 of 5 bubbles | You can't miss it. I don't mean you have to do... | 36.24 | -116.82 | 3,373,063.14 acres (13,650.3 km2) | 1678660 |
Wait, really? I thought we would’ve had this for sure in our database…
On closer inspection, we actually do, but hte coordinates on TrailForks versus what we got from the National Park data is a bit off. On the TrailForks page for Zabriskie Point, the coordinates are (36.420820, -116.810120), which is just outside the margin of error.
Image Matting – Jamie
1. Introduction of Image matting and MODNet
MODNet - Portrait Image Matting
Before web development, let’s take a look at a fun model about image matting - MODNet. With image matting, we could merge our selfies with pictures of any us national parks in the file background, or you can upload your own choice.
Image matting, also known as foreground/background separation, is a computer vision technique that aims to accurately extract the foreground object or region of interest from an image, while preserving the fine details and transparency information around the object boundaries. This process generates an alpha matte, which represents the opacity values for each pixel, allowing for seamless composition of the foreground onto a new background.
The MODNet (Modulator-Decoupled Network) model is a deep learning architecture specifically designed for image matting tasks. It was introduced in a research paper by Zhanghan Ke, Jingyu Zhang, Kaihao Zhang, Qiong Yan, and Kaiqi Huang in 2022. MODNet stands out from other image matting models due to its unique approach and several key features:
Decoupled Modulation: MODNet decouples the modulation process of the foreground and background features, allowing the model to better capture the intricate relationships between the foreground and background regions. This decoupling helps to improve the accuracy of the alpha matte predictions, especially around complex object boundaries.
Effective Feature Fusion: MODNet incorporates an effective feature fusion mechanism that combines multi-level features from different stages of the network. This fusion strategy helps to capture both low-level details and high-level semantic information, leading to more accurate and coherent alpha matte predictions.
Lightweight Architecture: Despite its impressive performance, MoDNet has a relatively lightweight architecture compared to other state-of-the-art image matting models. This makes it more efficient and suitable for deployment on resource-constrained devices or in real-time applications.
Improved Generalization: MODNet demonstrates strong generalization capabilities, meaning it can produce accurate alpha mattes even for objects or scenes that are significantly different from the training data. This is a crucial advantage over many traditional image matting methods that often struggle with generalization.
The key innovation of MODNet lies in its decoupled modulation approach, which allows the model to effectively disentangle the foreground and background features, leading to superior performance in capturing intricate object boundaries and transparency information. This architectural design, combined with effective feature fusion and a lightweight structure, has made MoDNet a notable advancement in the field of image matting.
There are several other state-of-the-art models for image matting tasks, in addition to the MODNet architecture. Here are some notable ones:
GCA Matting: Proposed in 2020, the Guided Contextual Attention (GCA) model utilizes a two-stream encoder-decoder architecture with a contextual attention module. This module helps the model better capture long-range dependencies and global context information, leading to improved performance on complex scenes.
AlphaMatting: Introduced in 2021, AlphaMatting is a transformer-based model that leverages the self-attention mechanism to effectively capture long-range dependencies in images. It achieves impressive results, particularly in handling highly complicated backgrounds and foreground objects with intricate structures.
SHM Matting: The Spatially-Hierarchical Matting (SHM) model, proposed in 2022, employs a hierarchical architecture that processes the input image at multiple spatial scales. This approach helps the model capture both fine-grained details and global structures, leading to improved accuracy, especially around object boundaries.
BGMatting: Introduced in 2022, BGMatting (Background Matting) is a two-stage model that first predicts a coarse alpha matte and then refines it using a background estimation module. This unique approach helps the model better handle challenging cases with complex backgrounds or semi-transparent objects.
HDMatt: The High-Definition Matting (HDMatt) model, introduced in 2022, is designed to produce high-resolution alpha mattes by leveraging a progressive upsampling strategy. It achieves impressive results, particularly for high-resolution images, while maintaining a relatively lightweight architecture.
These models represent some of the latest advancements in the field of image matting, each with its own unique architectural design and strengths. The choice of model often depends on factors such as the complexity of the scenes, the required level of detail, and the computational resources available.
Reference: https://github.com/ZHKKKe/MODNet
Let’s get start it!
2. Preparation
In the top menu of this session, select Runtime -> Change runtime type, and set Hardware Accelerator to GPU.
Clone the repository, and download the pre-trained model:
First we import the os module, which provides functions for interacting with the operating system.
import os
# changes the current directory to /content.
# %cd is a Jupyter Notebook magic command used to change directories within the notebook.
%cd /content
# checks if a directory named MODNet exists in the current directory.
# If it doesn't exist, it clones the GitHub repository located at https://github.com/ZHKKKe/MODNet into a directory named MODNet.
if not os.path.exists('MODNet'):
!git clone https://github.com/ZHKKKe/MODNet
# changes the current directory to the MODNet directory created or found in the previous step.
%cd MODNet/
# defines the path where the pre-trained checkpoint file will be saved or checked for
pretrained_ckpt = 'pretrained/modnet_photographic_portrait_matting.ckpt'
# checks if the file specified by pretrained_ckpt exists.
# If it doesn't exist, it proceeds with downloading the file.
if not os.path.exists(pretrained_ckpt):
# downloads the pre-trained checkpoint file from Google Drive using gdown.
# The file is saved in the specified path (pretrained/modnet_photographic_portrait_matting.ckpt).
# The --id flag specifies the ID of the file on Google Drive, and -O specifies the output filename.
!gdown --id 1mcr7ALciuAsHCpLnrtG_eop5-EYhbCmz \
-O pretrained/modnet_photographic_portrait_matting.ckpt/content
Cloning into 'MODNet'...
remote: Enumerating objects: 276, done.
remote: Counting objects: 100% (276/276), done.
remote: Compressing objects: 100% (159/159), done.
remote: Total 276 (delta 105), reused 252 (delta 98), pack-reused 0
Receiving objects: 100% (276/276), 60.77 MiB | 37.53 MiB/s, done.
Resolving deltas: 100% (105/105), done.
/content/MODNet
/usr/local/lib/python3.10/dist-packages/gdown/cli.py:138: FutureWarning: Option `--id` was deprecated in version 4.3.1 and will be removed in 5.0. You don't need to pass it anymore to use a file ID.
warnings.warn(
Downloading...
From: https://drive.google.com/uc?id=1mcr7ALciuAsHCpLnrtG_eop5-EYhbCmz
To: /content/MODNet/pretrained/modnet_photographic_portrait_matting.ckpt
100% 26.3M/26.3M [00:00<00:00, 64.5MB/s]
Now ler’s try this out.
3. Upload Images
Upload portrait images to be processed (only PNG and JPG format are supported):
The following code ensures a clean slate by removing and recreating both input and output folders.
Users can then upload images, which are automatically moved into the input folder for processing.
shutil: This module provides a higher-level interface for file operations, such as copying files and removing directories. google.colab.files: This module provides utilities for interacting with files in a Google Colab environment, including uploading and downloading files.
import shutil
from google.colab import filesSets up the input folder path where the images will be stored for processing. It checks if the input folder already exists. If it exists, it removes the entire folder and its contents (shutil.rmtree). Then, it creates a new, empty input folder (os.makedirs).
# clean and rebuild the image folders
input_folder = 'demo/image_matting/colab/input'
if os.path.exists(input_folder):
shutil.rmtree(input_folder)
os.makedirs(input_folder)Similar to the input folder, this block sets up the output folder path for storing processed images. It checks if the output folder already exists. If it exists, it removes the entire folder and its contents. Then, it creates a new, empty output folder.
output_folder = 'demo/image_matting/colab/output'
if os.path.exists(output_folder):
shutil.rmtree(output_folder)
os.makedirs(output_folder)This part allows the user to upload images into the Colab environment.
files.upload() prompts the user to select and upload files. It returns a dictionary where the keys are the uploaded file names and the values are the data.
list(files.upload().keys()) extracts the names of the uploaded files. A loop iterates through each uploaded image file: shutil.move(image_name, os.path.join(input_folder, image_name)) moves each uploaded image file from the current directory to the specified input folder. This step organizes the uploaded images into the input folder for further processing.
# upload images (PNG or JPG)
image_names = list(files.upload().keys())
for image_name in image_names:
shutil.move(image_name, os.path.join(input_folder, image_name))Saving 170891_00_2x.jpg to 170891_00_2x.jpg
4. Inference
The following code runs a Python script/module for image matting inference, specifying the input directory containing the images to be processed, the output directory where the processed images will be saved, and the path to the pre-trained model checkpoint file.
Run the following command for alpha matte prediction:
!python -m demo.image_matting.colab.inference \
--input-path demo/image_matting/colab/input \
--output-path demo/image_matting/colab/output \
--ckpt-path ./pretrained/modnet_photographic_portrait_matting.ckptProcess image: 170891_00_2x.jpg
Let’s break down what each part of the command does:
!python - This is a shell command that tells the system to run a Python interpreter.
-m demo.image_matting.colab.inference - -m flag is used to run a module as a script. - demo.image_matting.colab.inference specifies the Python module to run. It’s likely that this module contains the code for performing image matting inference.
--input-path demo/image_matting/colab/input - --input-path is a command-line argument for specifying the path to the input images directory. - demo/image_matting/colab/input is the path to the directory where the input images are stored.
--output-path demo/image_matting/colab/output - --output-path is a command-line argument for specifying the path to the output directory where the processed images will be saved. - demo/image_matting/colab/output is the path to the directory where the processed images will be saved.
--ckpt-path ./pretrained/modnet_photographic_portrait_matting.ckpt - --ckpt-path is a command-line argument for specifying the path to the checkpoint file for the pre-trained model used in image matting. - ./pretrained/modnet_photographic_portrait_matting.ckpt is the path to the pre-trained model checkpoint file relative to the current directory.
5. Visualization
Display the results (from left to right: image, foreground, and alpha matte):
import numpy as np
from PIL import ImageThe following function is useful for visualizing the process of image matting, where the foreground is extracted from the original image based on the provided matte.
from mergePicture import combined_display
import inspect
# Print the source code of the 'query_climate_database' function
print(inspect.getsource(combined_display))def combined_display(image, matte):
# calculate display resolution
w, h = image.width, image.height
rw, rh = 800, int(h * 800 / (3 * w))
# obtain predicted foreground
image = np.asarray(image)
if len(image.shape) == 2:
image = image[:, :, None]
if image.shape[2] == 1:
image = np.repeat(image, 3, axis=2)
elif image.shape[2] == 4:
image = image[:, :, 0:3]
matte = np.repeat(np.asarray(matte)[:, :, None], 3, axis=2) / 255
foreground = image * matte + np.full(image.shape, 255) * (1 - matte)
# combine image, foreground, and alpha into one line
combined = np.concatenate((image, foreground, matte * 255), axis=1)
combined = Image.fromarray(np.uint8(combined)).resize((rw, rh))
# extract the middle image
middle_image = Image.fromarray(np.uint8(foreground))
return combined, middle_imagecombined_display, takes an image and its corresponding matte (alpha channel) as inputs and returns two images: one for combined display and the other for the middle image (foreground).
Here’s what each part of the function does:
- Calculate Display Resolution:
- It calculates the display resolution for the output image.
wandhstore the width and height of the input image, respectively.rwis set to 800, indicating the desired width for the output image.rhis calculated to maintain the aspect ratio of the input image.
- Obtain Predicted Foreground:
- Convert the input image and matte to NumPy arrays (
imageandmatte). - Check if the input image is grayscale or has an alpha channel. If so, convert it to a 3-channel image.
- Repeat the matte across channels and normalize it.
- Calculate the predicted foreground by applying the matte to the input image.
- Convert the input image and matte to NumPy arrays (
- Combine Image, Foreground, and Matte:
- Concatenate the input image, predicted foreground, and matte along the horizontal axis.
- Convert the combined array back to an image (
Image.fromarray) and resize it to the calculated resolution.
- Extract Middle Image:
- Convert the predicted foreground array to an image (
Image.fromarray) to extract the middle image.
- Convert the predicted foreground array to an image (
- Return Output:
- Return the combined display image and the middle image.
Here’s the explanation of the return values: - combined: The combined image showing the original image, predicted foreground, and matte (alpha channel) concatenated horizontally. - middle_image: The image representing the predicted foreground, extracted from the combined image.
bg_dir = '/content/sample_data/Badlands.jpeg'
# Load the background image of Badlands National Park
background_image = Image.open(bg_dir)This code segment iterates through all the images in the input folder, visualizes each image with its corresponding matte, and then displays the merged image where the middle image is composited onto a background based on the matte.
# visualize all images
image_names = os.listdir(input_folder)
for image_name in image_names:
matte_name = image_name.split('.')[0] + '.png'
image = Image.open(os.path.join(input_folder, image_name))
matte = Image.open(os.path.join(output_folder, matte_name))
combined, middle_image = combined_display(image, matte)
# Display combined image
display(combined)
# Display merged
merged = Image.composite(middle_image,background_image, matte)
print(image_name, '\n')
display(merged)
170891_00_2x.jpg

As you can see, the first line of images corresponds to original image, image we get, and matte(alpha)
The second line is the image we merged with background.
Let’s break down what each part does:
Iterating Through Image Files:
image_names = os.listdir(input_folder)
for image_name in image_names:- This loop iterates through each file name in the
input_folder, which contains the input images.
Obtaining Matte File Name:
matte_name = image_name.split('.')[0] + '.png'- It extracts the file name of the matte corresponding to the current image by splitting the image file name at the ‘.’ character and appending ‘.png’ to it.
Opening Image and Matte:
image = Image.open(os.path.join(input_folder, image_name))
matte = Image.open(os.path.join(output_folder, matte_name))- It opens the input image and matte files using
Image.open()from the PIL library, specifying their respective paths.
Visualizing Combined Image and Middle Image:
combined, middle_image = combined_display(image, matte)
# Display combined image
display(combined)
# Display middle image
display(middle_image)- It calls the
combined_display()function to create the combined image and extract the middle image (predicted foreground). - Then, it displays both the combined image and the middle image using
display().
Merging Middle Image with Background:
merged = Image.composite(middle_image, background_image, matte)- It composites the middle image with a background image using the alpha channel provided by the matte.
Printing Image Name:
print(image_name, '\n')- It prints the name of the current image file.
Displaying Merged Image:
display(merged)- It displays the merged image, which combines the middle image with a background using the provided matte.
6. Implementation in Web
:With all the functions we made, we want to realize it in our web. Just like this:
How are we going to achieve this?
Web Development - Jamie
In the frontend component of our web application, we have meticulously crafted a user interface allowing users to upload two distinct images: a background image and a foreground selfie. These inputs are seamlessly transmitted to the backend server through an API invocation.
On the backend infrastructure, our implementation utilizes Node.js in conjunction with a JavaScript library for subprocess management. Leveraging this architecture, we orchestrate the execution of Python scripts responsible for interfacing with our deep learning model. This model is adept at extracting foreground elements from the selfie image and seamlessly compositing them onto the provided background image. Through a series of sophisticated operations, including image analysis and processing, our system ensures high-fidelity integration of the extracted subject into the selected backdrop.
Here are some background knowledge you might need to know:
JavaScript is a high-level, interpreted programming language primarily used to create dynamic and interactive content on websites. Initially developed by Netscape as a client-side scripting language for web browsers, JavaScript has evolved into a versatile language that can be used for both client-side and server-side development.
Key features of JavaScript include:
Client-Side Scripting: JavaScript is commonly used to add interactivity to web pages, such as responding to user actions like clicks, mouse movements, form submissions, and more. It can manipulate HTML elements, dynamically change styles, and modify content on the fly.
Cross-Platform: JavaScript is supported by all modern web browsers, making it a cross-platform language. This means that code written in JavaScript will run consistently across different browsers and operating systems.
Object-Oriented: JavaScript is an object-oriented language, allowing developers to create objects with properties and methods to represent real-world entities. Objects can be defined using classes or prototypes, and inheritance is supported through prototype chaining.
Asynchronous Programming: JavaScript supports asynchronous programming using callback functions, promises, and async/await syntax. Asynchronous programming allows tasks to be executed concurrently without blocking the main thread, which is essential for handling I/O operations, such as fetching data from servers or interacting with databases.
Functional Programming: JavaScript also supports functional programming paradigms, such as higher-order functions, closures, and anonymous functions. These features enable developers to write clean, concise, and reusable code.
Server-Side Development: With the advent of server-side JavaScript frameworks like Node.js, JavaScript can now be used to build scalable and high-performance server-side applications. Node.js allows developers to run JavaScript code on the server, enabling full-stack development using a single programming language.
Overall, JavaScript is a versatile language that is widely used for web development, ranging from simple scripts to complex web applications. Its popularity and extensive ecosystem of libraries and frameworks make it an essential tool for modern web development.
Node.js is an open-source, cross-platform JavaScript runtime environment that allows developers to run JavaScript code outside of a web browser. It is built on the V8 JavaScript engine, which is the same engine that powers Google Chrome.
Node.js enables developers to write server-side applications in JavaScript, making it possible to use JavaScript for both client-side and server-side programming. This is advantageous because it allows for the reuse of code and skills across different parts of a web application.
Some key features of Node.js include:
Asynchronous and event-driven: Node.js uses non-blocking, asynchronous I/O operations, which means it can handle many connections simultaneously without getting blocked. This makes it well-suited for building scalable and high-performance applications.
Single-threaded: Node.js uses a single-threaded event loop architecture, which allows it to handle many concurrent connections efficiently. It achieves concurrency by delegating I/O operations to the operating system’s kernel, freeing up the main thread to handle other tasks.
npm (Node Package Manager): npm is the default package manager for Node.js, providing a vast ecosystem of open-source libraries and tools that developers can use to build their applications.
Wide range of use cases: Node.js is commonly used for building web servers, RESTful APIs, real-time applications (such as chat applications and online gaming), streaming applications, and more.
Overall, Node.js has become a popular choice for building server-side applications due to its performance, scalability, and the ease of using JavaScript for both client-side and server-side development.
This is the structure of the file on GitHub:
This is the backend file structure:
How to run the web application on your computer?
Step 1: Prepare the Node Environment
Visit the official Node.js website at https://nodejs.org and download the installer suitable for your operating system (Windows, macOS, or Linux). Once downloaded, locate the installer file and execute it, following the on-screen instructions for installation.
To ensure that Node.js has been installed correctly, open a terminal and execute the command
node -v. This command will display the installed version of Node.js on your system.
Step 2: Start the Backend Server
Navigate to the /backend-main folder in a terminal session.
- Create a
.envfile within this directory containing the MongoDB USERNAME and PASSWORD required for database connectivity.
USERNAME=jamieluo
PASSWORD=oFWMJnNpsd9i0Gvp
Execute the command
npm installto install the project dependencies.Run the command
npm run devto start the backend server, establishing a connection to the MongoDB server and enabling it to listen for incoming requests from the frontend.
Step 3: Start the Frontend Server
Navigate to the /frontend-main folder in another terminal session.
Execute the command
npm installto install the project dependencies.Run the command
npm run devto initiate the frontend server.Note: As the frontend and backend servers typically run on different ports locally, and due to browser security policies that may block cross-origin requests, we utilize a middleware to circumvent this limitation. If your frontend project is not running on port 5174, please adjust the port number in line 13 of the
/backend-main/server.jsfile accordingly.Upon successful execution, a link will appear in your terminal indicating the URL for accessing the frontend. Click on this link to interact with the website.
By following these steps, you can set up the Node.js environment, start the backend server, initiate the frontend server and play with our website.
Now you can see what the main page looks like with filters:
When you click the login it will show:
If you are a new user, it will show:
And of course the Photoshop page you see it before.
What technology stack did we use in our web application?
Data Collection: Utilizing Python with Selenium, we scrape data from websites like AllTrails and TripAdvisor, leveraging their HTML DOM structure to extract the necessary information. This data is then stored in CSV format.
Data Analysis and Processing: After collecting the data, we analyze and process it to create structured JSON-formatted data. This processed data is crucial for further operations.
Database Integration: Using Mongoose, we seamlessly integrate the processed data into MongoDB. This allows for efficient storage and retrieval of data for future use.
Frontend Development: Employing Vue.js framework, we design a user-friendly UI/UX to deliver a streamlined experience. Users can effortlessly browse and favorite trails, while we utilize their browsing history and favorite trails’ characteristics to provide personalized recommendations.
Innovative Feature Integration: We incorporate an intriguing feature where users can upload their selfies. Leveraging deep learning techniques, we extract the human body parts from the images. Users can then select from a range of scenic images provided, and seamlessly paste their extracted images onto these backgrounds, creating personalized compositions.
Backend Implementation: Using Node.js and Express.js along with MongoDB, we architect robust APIs to serve the required data to the frontend. This backend infrastructure ensures smooth communication between the frontend and database, enabling seamless functionality for the users.
Ethical Ramifications and Concluding Remarks
We have no control over what users do with the recommendations that they receive. This could mean that, based on certain criteria, that they engage in malicious behavior on certain trails, for example, or they may attempt to monetize the results from our tool despite the fact that we’d like it to be freely available. Even if we include disclaimers, warnings, or agreements that users have to abide by, once a recommendation is generated, it’s out of our hands.
As for biases, language processing tools may only be used to certain linguistic conventions, which may prioritize the results from certain reviews over others. We don’t have experience using NLP, so this is something that we anticipate having to address as we go along. See “Risks” above.
All in all, we think that we executed our project well on the technical portion. We gave users most similar trails based upon reviews and supplemented this information by merging these similar trails with the numeric data on TrailForks. We think we could improve by implementing this onto a website better such that the person does not have to open our Jupyter Files and manually input the trails they want.
Comment Similarity Function
Now let us create a function called
comment_similaritywhich takes in ournational_parks.csvfile we just created via thepark_dataparameter, acomment_indexparameter, and anall_commentsparameter which is our word embedding vector representation of all comments in our csv file.Now let us show what our returned dataframe looks like
As we can see, we return a dataframe of the most similar review to the review with the index
999. To see how similar this similar review is to our inputted review let us output both comments.First the original comment
Now the similar comment.
As we can see the comments are very similar! They both talk about the dangers of the trail and how they both saw people fall.